Library¶
In [ ]:
import pandas as pd
import numpy as np
from bertopic import BERTopic
import sys
sys.path.append('../../Util')
import ShowGraphs as sg
3 Baseline Summary Content¶
Clustering Approach¶
- Parameter Setting:
- Embedding Model: all-MiniLM-L6-v2
- Representation Model: keyBERTInspired, MaximalMarginalRelevance
- Count Vectorizer
- CtfIDF
- UMAP: 20 neighbors, 8 components
- HDBSCAN: 190 min cluster size
- Zero-Shot Classification on cluster name
- Outliers-reduction with Probabilites: 0.03 threshold
Clustering Results¶
Clusters Retrieved: 121 in which the most important ones concern:
- Drug sales (marijuana, cocaine, xanax, pills, meth, fentanyl,
- Bitcoin
- Scammers and seller reviews
- Marketplace advertising
- Purchase reviews
- Drug purchases
- Orders
- Closed sites (empire market,
- Scams
- Sold passwords
- Hacker attacks
- Opsec questions
- Document and credit card forgery
- Chat links
Performance Metrics:
- Silhouette Score: 0.60
- Davies-Bouldin Score: 0.46
- Coherence-Score: 0.69
- Dos Score: 0.24
- %Outliers: 0.35 (91k/260k)
Code¶
In order to avoid importing the whole BERTopicUtils.py module, and thus the linked libraries, I have only brought back the function needed to make predictions, it is the same as in the module
In [3]:
def predict_topic(topic_model: BERTopic, sentence: list, num_classes: int = 5, custom_labels: bool = False) -> pd.DataFrame:
"""
Predict the topic of a sentence using the BERTopic model.
:param topic_model: The BERTopic model.
:param sentence: The sentence to predict the topic of.
:param num_classes: The number of classes to return.
:param custom_labels: Whether to use custom labels.
:return: A DataFrame with the predicted topics.
"""
# Transform the sentence
_, pr = topic_model.transform(sentence)
# Get the top indices
top_indices = np.argsort(pr[0])[::-1][:num_classes]
# Get the top topics
if custom_labels:
top_topics = [(topic_model.get_topic(i), pr[0][i], topic_model.custom_labels_[i+1]) for i in top_indices]
else:
top_topics = [(topic_model.get_topic(i), pr[0][i], topic_model.generate_topic_labels()[i+1]) for i in top_indices]
# Create a DataFrame with the results
df_finals = pd.DataFrame(top_topics, columns=['Topic', 'Probability', 'Label'])
# Extract the words and sentence
df_finals['Words'] = df_finals['Topic'].apply(lambda topic: [word for word, _ in topic])
df_finals['Sentence'] = sentence * len(df_finals)
return df_finals
Visualize CSV Files¶
In [5]:
topic_model = BERTopic.load("../../Analyze_files/CombiningAnalysisCompleteDataset/ContentAnalysis/ModelsContent/topic_model_all-MiniLM-L6-v2_190_20n_8dim", embedding_model='all-MiniLM-L6-v2')
In [2]:
descr_topic = pd.read_csv('CSV121Topic/description_topic.csv')
document_topic = pd.read_parquet('CSV121Topic/document_topic_proba.parquet')
topics_over_time = pd.read_csv('CSV121Topic/topic_over_time_5.csv')
Topic Description¶
In [67]:
print(descr_topic.shape[0])
descr_topic.head()
121
Out[67]:
| Topic | Count | BERTopic_Name | Representation | Representative_Docs | Custom_Name_GenAI | Custom_Name_Zero_Shot | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 14083 | 0_cart_weed_strain_thc | ['cart' 'weed' 'strain' 'thc' 'bud' 'price' 'p... | ['general review template general information ... | Product Reviews and Purchases | weed - thc - cart |
| 1 | 2 | 5696 | 2_key_pgp_account_pgp key | ['key' 'pgp' 'account' 'pgp key' 'password' 'm... | ['ordered item dream attempt send address vend... | PGP Key Security | pgp key |
| 2 | 1 | 5365 | 1_deposit_address_ticket_btc | ['deposit' 'address' 'ticket' 'btc' 'wallet' '... | ['missing two deposit week ago big deposit err... | Empire Deposit & Withdrawal Issues | ticket - deposit - address |
| 3 | 5 | 4584 | 5_thanks_thank_lol_man | ['thanks' 'thank' 'lol' 'man' 'bro' 'good' 'ni... | ['love man thanks work brother' 'damn thats fu... | Friendly Positive Talk | thanks |
| 4 | 25 | 3778 | 25_mg_pill_tablet_price | ['mg' 'pill' 'tablet' 'price' 'xtc' 'pharma' '... | ['hey empire customer back short vacation gene... | Drug Sales | xtc - mg - pill |
In [8]:
sg.plot_topic_distribution(descr_topic, figsize=(12, 20))
In [9]:
sg.plot_topic_percentage_distribution(descr_topic, figsize=(12, 6))
In [ ]:
sg.create_wordclouds(topic_model, num_topics=121, cols=4, width=800, height=700)
Document Topic Description¶
In [68]:
print(document_topic.shape[0])
document_topic.head(5)
169288
Out[68]:
| Document | Topic | Probability | Created_on | BERTopic_Name | |
|---|---|---|---|---|---|
| 0 | finally got dream lilxan account confirmed pgp... | 9 | [0.0019145853509105579, 0.0034859457117316834,... | 2019-10-16 | 9_pgp_begin pgp_begin_pgp signature |
| 1 | im issues vendor account issues withdrawing cm... | 1 | [0.004161525996584431, 0.09864712150395494, 0.... | 2019-10-30 | 1_deposit_address_ticket_btc |
| 2 | making switch xmr besides xmr hodler favorite ... | 15 | [0.002285542372071366, 0.007466206317565759, 0... | 2019-10-16 | 15_monero_xmr_wallet_btc |
| 3 | got free cooky cart order cannacreations one e... | 0 | [0.2930367581669767, 0.0036731910177007464, 0.... | 2019-10-16 | 0_cart_weed_strain_thc |
| 4 | bg gone either oc look like he even cheaper oc... | 38 | [0.005669327905041996, 0.0035937638895178897, ... | 2019-10-16 | 38_pack_week_day_ordered |
In [17]:
document_topic['Max_Probability'] = document_topic['Probability'].apply(np.max)
In [31]:
sg.plot_avg_prob_or_freq(document_topic, 'Max_Probability', figsize=(25, 15))
In [30]:
sg.plot_boxplot(document_topic, 'Max_Probability', figsize=(25, 15))
In [32]:
sg.plot_probability_distribution(document_topic, 'Max_Probability')
In [ ]:
sg.create_wordclouds(document_topic, num_topics=121, cols=4, is_model=False, width=800, height=700)